feat: add eval job that runs on CI #1167

simonrosenberg · 2025-11-14T11:25:35Z

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:898eb1c-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-898eb1c-python \
  ghcr.io/openhands/agent-server:898eb1c-python

All tags pushed for this build

ghcr.io/openhands/agent-server:898eb1c-golang-amd64
ghcr.io/openhands/agent-server:898eb1c-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:898eb1c-golang-arm64
ghcr.io/openhands/agent-server:898eb1c-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:898eb1c-java-amd64
ghcr.io/openhands/agent-server:898eb1c-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:898eb1c-java-arm64
ghcr.io/openhands/agent-server:898eb1c-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:898eb1c-python-amd64
ghcr.io/openhands/agent-server:898eb1c-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:898eb1c-python-arm64
ghcr.io/openhands/agent-server:898eb1c-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:898eb1c-golang
ghcr.io/openhands/agent-server:898eb1c-java
ghcr.io/openhands/agent-server:898eb1c-python

About Multi-Architecture Support

Each variant tag (e.g., 898eb1c-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 898eb1c-python-amd64) are also available if needed

…e-agent-sdk into add-eval-job-run-on-CI

.github/workflows/run-eval.yml

.github/scripts/build_eval_targets.py

blacksmith-sh · 2025-11-23T12:51:36Z

[Automatic Post]: It has been a while since there was any activity on this PR. @simonrosenberg, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

simonrosenberg · 2025-11-24T10:01:08Z

@OpenHands /codereview-roasted
Clone the repos https://github.com/OpenHands/evaluation https://github.com/OpenHands/benchmarks/issues
and read the diff of this PR to understnad how it triggers the eval workflow for software-agent-sdk. The review the code in software-agent-sdk. Suggest any simplification and report any bug

openhands-ai · 2025-11-24T10:01:19Z

I'm on it! simonrosenberg can track my progress at all-hands.dev

openhands-ai · 2025-11-24T10:06:52Z

Code Review Summary: PR #1167 - Eval Workflow Integration

I've completed a comprehensive "Linus-style" code review of the PR that adds evaluation job triggering to the software-agent-sdk CI pipeline. Here's what I found:

How the Workflow Works

The PR adds a GitHub Actions workflow (.github/workflows/run-eval.yml) that:

Triggers via PR labels (run-eval-1/2/50/100), releases, or manual dispatch
Validates the triggering user against an allowlist and model stubs against another allowlist
Builds SWE-Bench Docker images using the OpenHands/benchmarks repo
Dispatches the actual evaluation to OpenHands/evaluation repo's workflow
Comments on the PR with evaluation details

🔴 CRITICAL ISSUES (Must Fix)

1. Security Risk: Using `eval` command (Lines 204-216)

The workflow constructs a shell command as a string then uses eval to execute it
This is dangerous and completely unnecessary since EVAL_LIMIT is always set
Fix: Remove the conditional flag logic and call the Python script directly with --n-limit "${EVAL_LIMIT}"

2. Bug: Incorrect SHA substring logic (Line 256)

SDK_SHA_SHORT=${SDK_SHA:0:40} extracts 40 chars, but SHA-1 is already 40 chars, so this does nothing
GitHub's API accepts full SHAs anyway, making this pointless
Fix: Remove the substring operation and use $SDK_SHA directly

3. Missing Error Handling: Dispatch call (Lines 236-240)

The curl command to dispatch the evaluation workflow has no error checking
If it fails, the workflow continues and reports success
Fix: Check the HTTP response code (should be 204) and fail if the dispatch fails

4. Data Flow Issue: Double checkout (Lines 68-71 + 151-157)

First checks out the base branch to validate allowlists
Then checks out the PR head to build images
This creates confusion about which ref is being evaluated
Improvement: Clarify this two-step checkout with comments or restructure

⚠️ IMPROVEMENT OPPORTUNITIES

Complex parameter resolution (Lines 94-149): Single bash script handles 3 trigger types with nested conditionals - works but could be cleaner with better comments
Model validation complexity (Lines 132-143): Mixes bash string processing with jq - should pick one approach
Release-to-PR lookup may fail (Lines 254-262): Assumes releases always map to PRs, which isn't guaranteed
Future-dated model names: The allowlist includes models dated 2025, which may become stale

✅ WHAT'S GOOD

Using pull_request_target with allowlist validation is the correct security approach
The overall architecture (build images, then dispatch eval) makes sense
Proper use of set -euo pipefail in most bash blocks
Clear separation of allowlist files

VERDICT: ❌ Needs rework before merging

The eval security issue and missing error handling on the dispatch call must be fixed. The SHA substring bug should also be addressed as it serves no purpose and demonstrates a misunderstanding of how Git SHAs work.

No code changes were made per the code review instructions - this is feedback only. The PR author should address these issues before merging.

_{View full conversation}

openhands-ai · 2025-11-24T10:20:09Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- .github/workflows/run-eval.yml

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1167 at branch `add-eval-job-run-on-CI`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

add eval job that runs on CI

3d57dee

simonrosenberg self-assigned this Nov 14, 2025

simonrosenberg and others added 14 commits November 14, 2025 12:41

Merge branch 'main' into add-eval-job-run-on-CI

3ab48bf

modify tests.yml so workflow can be triggered

82dd59e

Merge branch 'add-eval-job-run-on-CI' of github.com:OpenHands/softwar…

807475f

…e-agent-sdk into add-eval-job-run-on-CI

Use bot PAT secret for eval workflows

09b1372

Restore tests workflow

28c8318

Proxy tests workflow for eval run

17ce762

Restore tests workflow

1cfca6f

Align manual eval options with OpenHands

243607f

Remove explicit permissions to match OpenHands

4abff94

Checkout same ref as OpenHands

1a02b04

Align eval workflow patterns with OpenHands

912967a

Refine run-eval parameter handling

bf901c1

Align trigger step with OpenHands

a184c86

Document eval target helper

5841f05

simonrosenberg requested a review from xingyaoww November 17, 2025 10:48

simonrosenberg marked this pull request as ready for review November 17, 2025 10:48

Merge branch 'main' into add-eval-job-run-on-CI

5f13d65

enyst reviewed Nov 17, 2025

View reviewed changes

.github/workflows/run-eval.yml Outdated Show resolved Hide resolved

enyst reviewed Nov 17, 2025

View reviewed changes

.github/workflows/run-eval.yml Outdated Show resolved Hide resolved

enyst reviewed Nov 17, 2025

View reviewed changes

.github/workflows/run-eval.yml Outdated Show resolved Hide resolved

enyst reviewed Nov 17, 2025

View reviewed changes

.github/scripts/build_eval_targets.py Outdated Show resolved Hide resolved

simonrosenberg marked this pull request as draft November 17, 2025 13:09

simonrosenberg removed the request for review from xingyaoww November 17, 2025 13:09

Align model ids with config names

965e16d

simonrosenberg marked this pull request as ready for review November 17, 2025 13:31

enyst approved these changes Nov 17, 2025

View reviewed changes

Point to create-branch-v1 workflow

a55cdf4

simonrosenberg added 3 commits November 24, 2025 09:14

feat: build swebench images before triggering eval

9c50946

chore: add simonrosenberg to eval labelers

ae4d33f

chore: remove unused build_eval_targets helper

c0a878c

Merge branch 'main' into add-eval-job-run-on-CI

bc19e19

fix: harden eval workflow dispatch

0ff90f2

chore: fix ref expression in eval workflow

dff52ad

simonrosenberg merged commit 4f0cc50 into main Nov 24, 2025
21 checks passed

simonrosenberg deleted the add-eval-job-run-on-CI branch November 24, 2025 10:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add eval job that runs on CI #1167

feat: add eval job that runs on CI #1167

simonrosenberg commented Nov 14, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blacksmith-sh bot commented Nov 23, 2025

Uh oh!

simonrosenberg commented Nov 24, 2025

Uh oh!

openhands-ai bot commented Nov 24, 2025

Uh oh!

openhands-ai bot commented Nov 24, 2025

Uh oh!

openhands-ai bot commented Nov 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: add eval job that runs on CI #1167

feat: add eval job that runs on CI #1167

Conversation

simonrosenberg commented Nov 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

blacksmith-sh bot commented Nov 23, 2025

Uh oh!

simonrosenberg commented Nov 24, 2025

Uh oh!

openhands-ai bot commented Nov 24, 2025

Uh oh!

openhands-ai bot commented Nov 24, 2025

Code Review Summary: PR #1167 - Eval Workflow Integration

How the Workflow Works

🔴 CRITICAL ISSUES (Must Fix)

1. Security Risk: Using eval command (Lines 204-216)

2. Bug: Incorrect SHA substring logic (Line 256)

3. Missing Error Handling: Dispatch call (Lines 236-240)

4. Data Flow Issue: Double checkout (Lines 68-71 + 151-157)

⚠️ IMPROVEMENT OPPORTUNITIES

✅ WHAT'S GOOD

VERDICT: ❌ Needs rework before merging

Uh oh!

openhands-ai bot commented Nov 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

simonrosenberg commented Nov 14, 2025 •

edited by github-actions bot

Loading

1. Security Risk: Using `eval` command (Lines 204-216)